Outlier Suppression: Pushing the Limit of Low-Bit Transformer Language Models
133
FIGURE 5.8
Presentation of outliers over ˜X, γ and X′ of LayerNorm on BERT-SST-2. For example, at
dimension 308, γ and ˜X both have sharper values. By excluding γ, it can be seen that X′
holds milder distribution than ˜X.
FIGURE 5.9
The distribution using (mean + 3 * std) is drawn as the left border, then enumerating the
value to cut the tensor on RoBERTa-QNLI. The reflect the proportion of clipped tokens.
5.6.2
Gamma Migration
Specifically, the gamma migration produces a more quantization-friendly model by migrat-
ing the outlier amplifier γ into subsequent modules in an equivalent transformation and
bringing more robust activation for quantization without extra computation burden. As
shown in Fig. 5.10, γ will be excluded from the LayerNorm and moved to the shortcut
branch and weight of the next layer. As a result, the LayerNorm becomes the Non-scaling
LayerNorm. The shortcut branch and weight of the next layer absorb a new parameter γ.
From Fig. 5.10, the “Quant” process quantizes X′. Then the quantized output engages two
branches respectively. The first is the matrix multiplication on the bottom branch. The
second is multiplying parameter γ and experiencing the “DeQuant” process. In fact, the γ
calculation is delayed from LayerNorm to the shortcut branch. Thus, this new design will
not increase the computation overhead.
FIGURE 5.10
Left: quantization flow before. Right: gamma migration.